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Abstract. This paper look at how the Hopfield neural network can be used to store and 
recall patterns constructed from natural language sentences. As a pattern recognition and 
storage tool, the Hopfield neural network has received much attention. This attention 
however has been mainly in the field of statistical physics due to the model's simple 
abstraction of spin glass systems. 

A discussion is made of the differences, shown as bias and correlation, between natu- 
ral language sentence patterns and the randomly generated ones used in previous ex- 
periments. Results are given for numerical simulations which show the auto-associative 
competence of the network when trained with natural language patterns. 



1 Introduction 

As a pattern recognition and storage tool, the Hopfield network Q has received much 
attention. In particular the discrete model has been widely investigated in the field of statistical 
physics (e.g. @ § [§ @ (l0[ [0 (l§) because of its simple abstraction of a spin glass system. 

This paper looks at how the Hopfield memory can be used to store and recall patterns which 
represent natural language sentences. It will be shown that these patterns behave differently to 
those randomly generated ones used in previous experiments. 

The feature of natural language patterns which we describe in this paper is that the networks 
they form have very low activation levels. This introduces complications from a computational 
point of view in the form of non-zero mean noise, which unchecked would make recall impossible. 
From an AI perspective we note that low activation is an intrinsic feature of the parts of the 
brain dealing with concept association. 

Firstly though, why use the Hopfield network in NLP? The goal of our research is to adapt 
associative neural networks for use in context related NLP tasks such as word sense disam- 
biguation and lexical transfer in machine translation. It is generally agreed that contextual 
knowledge plays an important role in the processing of language by people, but the complexity 
of word relations which together represent 'context' has defied analysis and prohibited progress 
towards context driven processing in NLP. Our intention is to use the neural network as a 
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sophisticated post-processing device in which contextual knowledge can be utilised through a 
multiplicity of word cooccurrence relations in a fully connected network. This contrasts with 
traditional statistical NLP methods such as those in Q and || which tend to model context 
through local surface word cooccurrence n-grams. 

One of the weaknesses of such statistical methods is in their localist treatment of context 
where only cooccurrences in a narrow window, usually covering a single sentence, contribute to 
the modelling of context. This ignores the fact that much contextual knowledge may be outside 
the sentence in which a word occurrs. We refer to such external contextual relations as 'global 
context' and our network approach seeks to capture this through indirect word association 
relations and to process it efficiently. 

Our basic criteria for choosing a connectionist approach are (a) automatic knowledge acqui- 
sition, (b) avoidance of combinatorial explosion, and (c) knowledge transparency. 

We view the need for machine learning of language from examples and a self-organising 
memory as crucial to large scale NLP. In this respect we agree with other so called 'bottom- 
up' paradigms such as statistical NLP and example-based NLP which are based on automatic 
knowledge acquisition. Unlike statistical NLP however, we see that the mathematical tools 
for measuring surface word associations in large corpora will yield only rough approximations 
for most of the interesting word relations and look for a better way to process the statistical 
knowledge with all its inaccuracies. 

Combinatorial explosion is a problem for large scale NLP with most paradigms and makes 
scaling up of systems difficult. We hope to be able to avoid this by using a model in which 
the number of word relations, as measured by simple coooccurrences in sentences, do not effect 
efficiency in terms of processing times. 

Transparency of the knowledge base is our final basic criteria and means that the repre- 
sentation of linguistic knowledge should be analysable and easily interpreted by non-experts in 
connectionism. This should aid verification of results. 

Neural networks are thought of as being 'black boxes' in which knowledge is so heavily 
encoded that it defies inspection. We will be using a localist storage method in which seman- 
tic transparency of language is preserved. Additionally, the Hopfield network has dynamics 
which are mathematically tractable and understood within the limits imposed by statistical 
physics. We can therefore apply both a linguistic as well as a mathematical explanation to our 
results. This contrasts with previous work in connectionist NLP such as Ide and Veronis 
where the networks used are structured and incomplete and have no well understood theoretical 
framework. 

In addition to our basic criteria we look particularly to neural networks to provide gen- 
eralisation. This is a crucial capability in any paradigm which processes language because we 
cannot expect to train our models on the complete set of examples which we will encounter. The 
generalization capability manifests itself in the network being able to learn relationships which 
were not linear in the training set. We hope to exploit this function and develop a practical 
connectionist alternative to other bottom-up NLP paradigms. 

As a first step in an ongoing series of experiments we intend to explore the basic functionality 
of the Hopfield network for use in storing natural language sentences. This is important because 
it establishes the basic properties of the network and provides a foundation for future work in 
this area. In the future we want to develop the model to allow multi-word sense disambiguation 
in sentences using large-scale contextual knowledge derived from corpus statistics. 



2 Sentence Patterns 



Previous work in statistical physics has looked at the storage of randomly generated bit vectors 
in which the bits all have equal probabilities of being 1 or 0. For many real world tasks such 
as NLP or vision processing, we would like to store non-random patterns where the bits do not 
occur with equal likelihood. It is therefore natural to explore the properties of the model for 
storing non-random patterns in our domain of interest. 

The training vectors we use can be regarded as a set of n patterns for (// = l,..,n) 
representing the n sentences we wish to store. Each pattern consists of iV bits with each bit 
taking a value of or 1, where £f = 1 (£f = 0) represents the presence (absence) of a word at 
index i in a lexicon in pattern (and sentence) fj,. In our localist representation each unit in the 
Hopfield network derived from these training patterns similarly corresponds to one word in the 
lexicon. 

The representation allows us to model linguistic properties of language which are dependent 
on frequency and cooccurrence of words. With enough training data we can capture useful 
information about word contexts. 

Several linguistic factors make natural language patterns different to those generated ran- 
domly. These are: 

1. Bias. Words in the lexicon do not occur with the same distribution. Technical vocabulary 
and proper names especially tend to have a very low frequency, even when the training 
corpus is very large. 

2. Internal Correlation: Syntactic and semantic factors mean that the probability distributions 
of two words appearing in the same sentence are not independent. 

3. External Correlation: Pragmatic factors mean that words and phrases which appear before 
others in a continuous and coherent piece of text influence the probability distributions of 
those which come later. 

It is beyond the scope of this work to calculate a macro-statistic for correlation between 
patterns as described above. We can however think of other measures which reflect part of this 
information. These are outlined below. 

Bias is a measure of how likely a bit in the training set £W is to be 1. In our model this is 

Pr{^ ] = 1) = = (1) 

This statistic serves only as a gross summary and does not show correlation features of pat- 
terns which were discussed above. Nevertheless it is simple to calculate and shows us something 
of the nature of the patterns we are storing. Clearly for a network storing unbiased patterns 
Pr(^ ] = 1) will be 0.5. 

Pattern recognition studies which use randomly generated patterns have for practical reasons 
assumed that bias is the same for all bits of all patterns in all contexts. For natural language 
patterns we should not make such an assumption because the frequency of bits is directly linked 
to the distribution of words in a corpus. 

Previous work (e.g. Q and has shown that connectivity is also an important factor in 
determining the network's behaviour. We will define mean connectivity informally as the mean 
number of different words which any single word cooccurs with in a sentence. We will define 
this more formally in the next section. 



3 The Model 



The discrete Hopfield model which we explore as the basis for our work is a fully connected 
network of N units, where the synaptic connection strengths are held in a 'weight' matrix T 
with Tij representing the weight on the symmetrical arc between units i and j. The output 
from a unit i is V{ and comes from internal and external sources with the internal inputs 



N 

Hi = E T » V i - U < ( 2 ) 



where Ui is a threshold. The external input Ii is calculated and set at the start of processing. 
Note that self interaction between a unit and itself is prohibited. 

In the version of the network which we use, stored patterns are recalled using a recall 
prescription 

JO if Ni<Ui 
'11 if N,>U t [6) 

where 



N 

N^J^TijVj + Ii (4) 

The operation of individual units in the network is quite simple as we can see from Eqn. 
where a weighted sum of inputs from all other units determines whether the unit outputs a 1 
or a 0. This disguises the fact that the collective behaviour of a system of such fully connected 
units is quite complex. 

Training the network occurs by bringing into correspondence the patterns we wish to store, 
£(a0 f or (^ = i ; _ #j n ^ anc [ stable states in the network's dynamics, called nominated states. 

At the same time we want to avoid creating spurious stable states which do not correspond to 
any of the training patterns. 

Storage is effected through the weight matrix T and the threshold vector U. Although there 
are more effective storage prescriptions (e.g. see Tarassenko et al |p0| ), we have chosen to use 
the Hebb rule 



I, (5) 

and to set all the elements in the threshold vector, U, to a constant which is calculated at 
the start of processing. Using the Hebb rule makes our results compatible with earlier work in 
statistical physics and is also computationally convenient when computing the weight matrix. 

For storage to be guaranteed to take place, T must be both symmetrical and have a zero 
diagonal, so we introduce the additional rule 



T u = (6) 

The Hebb rule (||) for finding the weights between two units (words) in the network in- 
tuitively corresponds to the frequency of cooccurrence of the words in the training sentences, 



ignoring multiple cooccurrence in the same sentence. This simple relation between the training 
data and the representation ensures semantic transparency. We can also see that learning of 
sentences in the approach we have outlined here is both automatic and self-organizing, in that 
we do not decide a priori which word relations are to remain and which are to be ignored. 

Unfortunately, mean field analysis by Amit |Q has predicted that using the storage equation 
with biased patterns will lead to noise overwhelming signal and the destabilising of nomi- 
nated states. Rather than use a non-localist storage prescription such as the projection matrix 
of Kohonen [fl6f or Personnaz et al | fl9| we intend to compensate for local noise by implementing 
a global inhibitor. This is inspired by comments in Buhmann and Schulten Q as well as Amit 
[0 and allows us to stabilise states corresponding to nominated patterns by compensating for 
the noise. In this way we have replaced elements TV, = where i ^ j with = — 10/N. The 
constant 10 corresponds approximately to the number of content words (the number of Is) in 
a training pattern vector. 

We can now formally define mean connectivity, c(T) as 

N N 

c(T) = iV- 1 ^^.g(^) (7) 

i 3 

for 



/ \ _ J 1 if x > 1 (q\ 
1 if otherwise 

Finally we define matrix sparsity which shows the fraction of the weight matrix T which is 
non-zero 



s(T) = N~ l c(T) (9) 

In order to recall the stored patterns the initial output vector is set to a training pattern 
from £^ . The network is then updated stochastically and randomly until it settles into a stable 
state as shown by the convergence of the energy function 



N N N N 

E ({ y }) = - 2 E E + E u m - E w ( 10 ) 

i j i i 

E({V}) has been proven by Hopfield Q to be a strictly decreasing function of processing 
time and converges when the network has settled into a stable state. 



4 Limitations 

Numerical studies by Hopfield |Q showed that the effective storage capacity of the network 
was linked to a storage ratio 

a = n/N (11) 

where n is the number of patterns and N is the number of bits in each pattern. Initial 
estimates for reliable storage showed that a value of 0.1 < a < 0.2 was most effective. 



Analytical techniques used by Amit et al [|3| have shown the existence of a critical value of 
a called a c where auto-associative recall degrades discontinuously when a exceeds a c . Further 
studies by Grensing et al showed that if we accept a small amount of error, say 0.005% then 
a c ks 0.15. 

The reason why patterns can be stored and recalled in the Hopfield network is because 
the patterns are made to correspond to stable states, called minima in the energy landscape 
formed by the set of all possible output states of the network. When a < a c each stored pattern 
corresponds to a single stable state. As the critical value is exceeded multiple correspondences 
between patterns and stable states occur and spurious minima appear. Recall then rapidly 
becomes impossible. 

For our purposes we view the critical value as a serious limitation to the development of the 
Hopfield network for practical NLP. This is obvious when we consider that we are limited in 
the number of patterns we can store to n < Na c . 

The effect of bias on the critical value through correlated patterns such as those we propose 
to store is not clear and the literature apparently points in conflicting directions. Amit || 
for example found that in the large N limit recall degraded discontinuously at a c with random 
patterns. Researchers who have looked at biased patterns such as Grensing et al |T^] have found 
that recall degraded continuously when a exceeded a c up to a second storage ratio value ao- 
Interestingly Amit et al Q also found that bias induced a shift in the critical value from 0.14 
to 0.18. 

We aim in this paper to show through numerical simulations how bias in natural language 
sentences effects the critical value and the dynamical properties of the network. In particular 
we want to see (a) whether non-random biased patterns are stored as effectively as random 
unbiased ones, and (b) if storage takes place then do we observe a discontinuity and a critical 
value at a c = 0.14. We also want to see if we can find some causal relation between bias and 
the critical storage value. 



5 The Training Corpus 

The training vectors, are derived from the Asahi corpus of newspaper editorials. A full 
specification for this corpus is given by Collier et al in , parts of which have been repeated 
here for completeness. We use this parallel corpus because it is convenient for our next stage of 
work which will look at lexical transfer, a sort of word sense disambiguation, from English to 
Japanese. This need not concern us here except in so far as the characteristics of the sentences 
in the corpus effect the storage results. We therefore present a short outline of the features of 
the corpus. 

The corpus is in English and Japanese and has been aligned at the sentence level. In our 
experiments we only use the English sentences. Moreover, as a pre-processing stage we remove 
all of the function words because they do not contribute significantly to the context of the 
sentence. Of the 330,000 English words in the corpus, approximately 39.7% are function words. 

The mean length of the sentences which are left is approximately 10 content words and we 
have calculated that the mean number of cooccurrences between word pairs is 116 . This means 
that the training patterns are very sparse and so is the resulting weight matrix T. 

Another influence on the network performance which comes from the training data is the 
range of values which the storage ratio, a, takes. This can be calculated from the lexicon 
closure curve for the corpus. The lexicon growth curve has been found to closely match the 
curve F(n) = llOn^. 



If we take the number of sentences as n and the lexicon size as F(n) then we find that 
as n approaches 12000, a « n/F(n) — » 1.0, which is clearly above the values given for a c 
and indicates that storage of such sentences is impossible. For subcorpora extracted from these 
12000 sentences we have generally found that 0.14 < a < 1.0. 



6 Results 



In order to test the effectiveness of storage for a particular pattern £ p using numerical simula- 
tions we can measure the fractional Hamming distance between an actual stable state, V*, and 
the nominated stable state, V*. This is defined as 



1 N 

i 

Clearly the storage prescription will be effective for a system with € {0, 1} and large 
N according to how D M is distributed about D=0 over a large number of trials. The following 
measure of recall effectiveness is derived from Bruce et al 



f b =J2 d ^= d ( 13 ) 

and shows the mean fraction of the nominated images which coincide with their cor- 
responding stable images in V^'. i.e. the fraction of bits which are recalled without error over 
a number of trials. The storage prescription is effective to the extent that Fb is less than 0.5 
- the value at which there is no coherent overlap between and V^>. 

To test the effectiveness of storage auto-association tests were done for five test matrices 
Tl to T5 with specifications given in Table 1. Auto-association involves presenting the network 
with a noisy version of a stored pattern and measuring the error in recall according to Eqn. 
(|l3|). The matrices Tl to T5 represent a range of sizes of subcorpora taken from the Asahi 
corpus and are expected to show the effects of scale on the Hopficld network. 

We may note that since the result of each presentation of a pattern to the model is non- 
deterministic and we conducted a large number of independent trials we can consider the tests 
as Monte Carlo simulations. 

We see from the scores for sparsity and connectivity in Table 1 that only a small fraction of 
the weight matrix is non-zero. This confirms that natural language patterns and the matrices 
they form are in the class of low activation networks. Moreover, a exceeds the expected levels 
for a c w 0.14 in all matrices. 

Due to the greater relevance of recalling a — 1 bit over a — bit in the nominated 
image the results are shown in two figures. Figure 1 shows Fb for bits in the nominated image 
which should be set to 1. Figure 2 shows F B for = bits. 

The mean fraction of randomly flipped bits in the nominated image at the start of 
processing is shown as mo- In all of the simulations except for T5, the mean was calculated 
from 10 trials of 50 patterns. For T5, 12 trials of 40 patterns were used. 
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Figure 1. Over generation: Mean fraction of error in recalled patterns F b against initial pattern noise 
m for 1 bits. Tl:x, T2:0, T3:D, T4:+, T5:A. 




Figure 2. Over generation: Mean fraction of error in recalled patterns F b against initial pattern noise 
m for bits. Tl:x, T2:0, T3:D, T4:+, T5:A. 



7 Conclusion 



Since the evaluation approach we use is numerical rather than analytical and we cannot conduct 
exhaustive tests we must regard the calculations as approximations as far as the storage capacity 
of the network is concerned. However, we have extrapolated our results over a large number of 
tests so we are quite confident of at least 2 decimal places of accuracy in the results with an 
overall error of 0.01 in the mean values for Fb- 

Despite the low activation levels shown by sparsity and connectivity in Table 1, the sim- 
ulations showed that patterns of natural language sentences could successfully be stored and 
retrieved from the Hopfield network. No sudden discontinuity in Fb was observed either for 1 
bits or bits. 

Overall recall was good for Tl to T4 with a poorer resistance to initial noise by T5. Indeed 
we note that even in an absence of noise, i.e. when mo = 0.0, the error in recall of 1 bits for T5 
was greater than 0. On close inspection we also see that T4 has a small error at mo = 0.0. From 
this evidence we conclude that a for T4 and T5 are above the critical level a c which gives us a 
value of 0.18 < a c < 0.20 which is in line with Grensing's findings for biased random patterns. 
This indicates that recall degrades continuously after a exceeds a c upto some point when recall 
totally fails. In our simulations we have not reached a point of total recall failure. 

We should also note that although in general looking at Fb is not a good way of detecting 
discontinuities in a we think that in our simulations N is sufficiently large to validate the 
method. 

In line with comments by other researchers (e.g. |l[) we note that one reason for successful 
recall of patterns in low activity networks is the degree of correlation between patterns. This is 
despite the values of a exceeding the expected critical value a c Significant correlations between 
stored patterns could be said to have interacted with bias to increase the critical storage value. 

In this paper we have shown that biased patterns which are correlated by an underlying 
complex linguistic distribution of word cooccurrences can be stored and recalled in a Hopfield 
network. Moreover, the network behaves differently to one trained with unbiased random pat- 
terns in that the critical storage ratio is increased from the theoretical limit and recall degrades 
continuously. This establishes two basic properties of the Hopfield network for NLP. Once the 
fundamental behaviour of the Hopfield network is known we can then usefully adapt it to 
association based NLP. 
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Matrix 


Tl T2 T3 T4 T5 


N 
n 
a 

s(T) 
c(T) 

Prob^ = 1) 


260 673 921 1357 4131 
41 121 167 273 1412 
0.16 0.18 0.18 0.20 0.34 
0.062 0.028 0.027 0.021 0.011 
15.95 18.84 24.87 28.90 44.50 
0.037 0.014 0.011 0.008 0.003 



Table 1. Training matrix characteristics 



